We begin with SynMap output based on the Schnable 2011 paper (see my notes for details on parameter and download options).
Expression data was retrieved from NCBI: GSE50191 and was reported in Walley 2016. This set includes 69 experiments (3 biological replicates, 23 tissues) for 62,547 mRNAs aligned to Zea mays B73 RefGen_v2 5a WGS annotation set.
Protein abundance data was retrieved from Walley 2016 supplamental table S2. This dataset contains 3 to 7 biological replicates over 33 tissues. Expression data is based on a subset of this protein data tissues. Values are in dNASF, which is summarized here.
We parse the SynMap output file with ks/kn values (i.e. a DAGChainer file) to a flat format using a perl script (parseKsKnFile.pl) and import the data into R. Some ks and kn values are empty, which is to be expected. These are ignored for the purposes of finding the median, but included in the results if they are in a block that meets the cutoff criteria. GO terms related to Maize gene models are imported as well.
Lets take a look at what we just imported.
We need to select a cutoff value for ks (synonymous mutation rate) to differentiate between the most recent alpha duplication event and the previous beta duplication. Looking at ks value frequency (left), we should not see any obvious cutoff point. By transforming ks values to block median values, we should see two (or more) peaks (center). In this (center) graph, the first peak represents orthologs and the second represents homeologs from “pregrass tetraploidy”. If the vertical line separates the first two peaks sufficently, it is probably a good enough value. If not, we need to pick a new cutoff value for ks. Viewing the log10 transform of the ks values helps differentiate types of homologs (better version of this graph with color can be generated in CoGe).
You have selected the log10(ks) cutoff value to be 0. Again, this should ideally separate two peaks in the middle graph.
Regions of synteny are not confined to a 1:1 relationship between Sorghum and Maize chromosomes. Where multiple regions in Maize are syntenic to the same region on Sorghum, we can assume this is a result of the whole genome duplication even. These are the regions which will be sorted into subgenome 1 and subgenome 2.
| Chromosome | b34889_1 | b34889_2 | b34889_3 | b34889_4 | b34889_5 | b34889_6 | b34889_7 | b34889_8 | b34889_9 | b34889_10 | total |
|---|---|---|---|---|---|---|---|---|---|---|---|
| a31607_1 | 2431 | na | na | na | 710 | na | na | na | 736 | na | 3877 |
| a31607_2 | na | 895 | na | na | na | na | 1532 | na | na | na | 2427 |
| a31607_3 | na | na | 1914 | na | na | na | na | 1174 | na | na | 3088 |
| a31607_4 | na | na | na | 893 | 1475 | na | na | na | na | na | 2368 |
| a31607_5 | na | 114 | na | 296 | na | na | na | na | na | na | 410 |
| a31607_6 | na | 1055 | na | na | na | na | na | na | na | 747 | 1802 |
| a31607_7 | 486 | na | na | 412 | na | 108 | na | na | na | 164 | 1170 |
| a31607_8 | 206 | na | 192 | 15 | na | 21 | na | na | na | 229 | 663 |
| a31607_9 | na | na | na | na | na | 765 | na | 419 | na | 58 | 1242 |
| a31607_10 | na | na | na | na | 202 | 406 | na | na | 700 | na | 1308 |
There are 0 unaligned sorghum chromosome(s) in the above table. If there are any unaligned sorghum chromosomes, consider picking a new ks cutoff.
In order to separate subgenome 1 from subgenome 2, we implement a greedy algorithm to collect non-overlapping (when projected on sorghum) syntenic blocks by size. Psuedocode of the algorithm is as follows: foreach sorghum chromosome, set the largest syntenic chromosome (by gene) as sub1. Foreach additional syntenic chromosome if doesn’t overlap what is already in sub1, add it to sub1. Else, if it does not overlap anything in sub2, add to sub2. Else toss. After all syntenic chromosomes are process for this sorghum chromosome, count the number of genes in sub1 and sub2. If sub2 is bigger, switch them. *Toss out “small” syntenic chromosomes (51 genes or less?), and consider a tolerance for how much overlap must occur to consider it a true overlap. Also consider switching to gene level overlap rather than whole-chromosome start/stop values.
In order to see if one subgenome tends to have more isoforms, we plot the density of isoform counts from each subgenome based on isoforms in the v4 release 32 GFF file. We scale the x axis using log10, as we have a very long tail on this data (i.e. most counts are 1, but a few range to nearly 600). The densities track each other very strongly, so there isn’t likely anything to look for here.
How many chromosomes with synteny to sorghum were correctly placed in subgenome 1 or subgenome 2? Items in “Paper” only are missed subgenome assignments. Items in “Greedy” only are false assignments. Items in center are correct. (currently sorted by hand)
First we show both how many unique GO terms have been assigned using a computational ev-code vs. an experimental ev-code. A few terms have been annotated using both or were assigned multiple times based on different publications, which is why n is greater than the number of unique GO terms in our set. We also break down the GO terms by type. A few terms may not have a type, typically because they have been made obsolete.
Second, we look at the same divisions but this time for all gene annotations (i.e. gene + GO Term + Ev-Code). Again, since some terms are assigned computationally and experimentally to the same gene or assigned multiple times based on different publications, n is greater than the number of unique GO term assignments in out set. The main takeaways here are that computationally assigned GO terms vastly outnumber experimental annotations, most annotations use MF terms, and there are a few terms falling through the cracks most likely due to being outdated.
GO Terms by subgenome. If the GO Term is assigned to any gene in the subgenome, it is counted once for that subgenome. This is a presence/absence test.
The main takeaways here are that 1) there are differences in functional assignments between each subgenome, regardless of which filters I apply. 2) the GOSlim for plants is extremely restrictive, and probably not useful. 3) Sub1 has more functional assignments than sub2, although this may be a direct correlation to the larger size (by definition) of sub1.
GO Term assignments to genes by subgenome, which takes into account the gene-GO Term pairings and considers homeologs equivalent for the purposes of determining overlap in assignments.
The main takeaways here mirror the assessment from the GO Term presence/absence check above. There are many differences in assignments, although we can assume that the same terms are reused very often based on the differenced between unique terms and gene annotations.
Cases where the maize1 homolog is more or less expressed than the maize2 homolog
Overall expression patterns between subgenome 1 and subgenome 2 are highly similar. Below we show to views of the same data. In the density graph, we can see that subgenome 1 is slightly shifted to the right of subgenome 2 expression, but otherwise follows the same shape. Viewed as a boxplot, you can again see that the quartiles in subgenome 1 and 2 are largely the same. This also shows a few high expressing outliers on subgenome 1.
Paired (two-sided) t-test suggests that there is no difference in means between FPKM of homeologs in sub1 vs. sub2 (p-val > 0.05, so we fail to reject null hypothesis of equal means):
##
## Paired t-test
##
## data: expressedPairs$FPKM_mean1 and expressedPairs$FPKM_mean2
## t = 1.769, df = 2853, p-value = 0.077
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.3611981 7.0253531
## sample estimates:
## mean of the differences
## 3.332077
I am curious if our definition of sub 1 and sub 2 (i.e. the larger syntenic block is defined as sub 1) is really the best way to decide which genes are in sub 1 and sub 2. It seems to me that if one could resonably expect that, given 2 identically duplicate genes, if there was no benefit to having duplicated function then it would be a matter of random chance which of the two duplicate genes would loose function over time. It also seems to me that the quickest way to “degrade” an unnecessary gene (evolutionarily speaking) is to reduce expression. So what happens if we pick the higher expressing gene to be in sub 1 and the lower expressing gene to be in sub 2? Not surprising, we can find differences in populations between high and low expressing pairs. I wonder if this holds across different maize lines?
In all cases where there is a homeolog on both maize1 and maize2, the we want to look to several measures to see if the genes on one subgenome tend to be present in greater abundance compared to the other. We can define this comparison as either greater expression, greater protein abundance, or a greater number of isoforms available to the gene. We can compare using a factor to define a cutoff for “greater”. (i.e. 4-fold higher expression/abundance etc.).
First we look at expression values. We find that maize1 consistently overexpresses the maize2 pair on average across all available data, regardless of factor (1,2, or 4), and regardless of whether or not we allow NA values to be treated as ‘0’.
Fig 9.1:
This same pattern holds for protein abundance data as well. Here we look at protein abundance data for the tissues that also have expression data.
Fig 9.2:
We can see this same trend for all experiments across all factors (fold-difference of 1 to 9 shown). The blue line (neither gene has greater expression) grows as factor increases since we have a stricter and stricter definition of what counts as greater expression. Red (maize1) is always above green (maize2) indicating that more maize1 homeologs have greater expression over their maize2 pair than vice versa for all expriments and all factors. Red and green lines both tend to zero when factor is large enough. Protein data (not shown) follows this same pattern.
Fig 9.3:
The number of isoforms each gene of the homeolog pairs is thought to express follows a pattern where maize1 tends to have more isoforms than maize2 homeologs.
Fig 9.4:
Looking at a few individual cases (* these are calculated as maize1 > maize2, since maize2 > maize1 makes a negative fold change):
| Maize1_url | Maize2_url | Sample | Value_maize1 | Value_maize2 | foldChange |
|---|---|---|---|---|---|
| Zm00001d039733 | Zm00001d008668 | Leaf Zone 2 (Stomatal) | 4084.690 | 1.582643 | 11.333675 |
| Zm00001d039733 | Zm00001d008668 | Ear Primordium 2-4 mm | 3159.010 | 1.392160 | 11.147932 |
| Zm00001d039733 | Zm00001d008668 | Leaf Zone 3 (Growth) | 2335.750 | 1.137135 | 11.004267 |
| Zm00001d039733 | Zm00001d008668 | Ear Primordium 6-8 mm | 2839.653 | 1.534157 | 10.854053 |
| Zm00001d039733 | Zm00001d008668 | Embryo 20 DAP | 1904.780 | 1.039390 | 10.839672 |
| Zm00001d044122 | Zm00001d011438 | Silk | 2306.653 | 1.317030 | 10.774297 |
| Zm00001d039733 | Zm00001d008668 | 7-8 internode | 1749.163 | 1.225893 | 10.478616 |
| Zm00001d039733 | Zm00001d008668 | Leaf Zone 1 (Symmetrical) | 2157.867 | 1.734310 | 10.281028 |
| Zm00001d045431 | Zm00001d036086 | Root - Cortex 5 Days | 1173.249 | 1.197500 | 9.936267 |
| Zm00001d039733 | Zm00001d008668 | 6-7 internode | 1390.567 | 1.581110 | 9.780520 |
| Maize1_url | Maize2_url | Sample | Value_maize1 | Value_maize2 | foldChange |
|---|---|---|---|---|---|
| Zm00001d045431 | Zm00001d036086 | Silk | 56580.732 | 35.635414 | 10.632783 |
| Zm00001d045431 | Zm00001d036086 | Endosperm Crown 27 DAP | 71582.340 | 94.567236 | 9.564047 |
| Zm00001d032748 | Zm00001d014063 | Leaf Zone 2 (Stomatal) | 1691.444 | 2.484244 | 9.411233 |
| Zm00001d045431 | Zm00001d036086 | 7-8 internode | 22308.627 | 49.224767 | 8.824002 |
| Zm00001d045431 | Zm00001d036086 | Secondary Root 7-8 Days | 46608.118 | 161.463500 | 8.173229 |
| Zm00001d033850 | Zm00001d013367 | Ear Primordium 6-8 mm | 99061.113 | 356.606790 | 8.117841 |
| Zm00001d045431 | Zm00001d036086 | Primary Root 5 Days | 47139.332 | 173.503759 | 8.085821 |
| Zm00001d032748 | Zm00001d014063 | Leaf Zone 3 (Growth) | 1399.480 | 5.360648 | 8.028267 |
| Zm00001d039865 | Zm00001d008619 | Embryo 38 DAP | 23635.479 | 100.652210 | 7.875431 |
| Zm00001d021702 | Zm00001d006539 | Pericarp/Aleurone 27 DAP | 7784.268 | 33.973148 | 7.840023 |
| Maize1_url | Maize2_url | Sample | foldChange_expr | foldChange_prot |
|---|---|---|---|---|
| Zm00001d045431 | Zm00001d036086 | Root - Cortex 5 Days | 9.936267 | 7.539404 |
| Zm00001d045431 | Zm00001d036086 | Root - Meristem Zone 5 Days | 9.724529 | 7.200059 |
| Zm00001d045431 | Zm00001d036086 | Root - Elongation Zone 5 Days | 9.563307 | 6.646152 |
| Zm00001d045431 | Zm00001d036086 | 6-7 internode | 8.231741 | 7.013317 |
| Zm00001d045431 | Zm00001d036086 | 7-8 internode | 7.930961 | 8.824002 |
Look at the top 10 pairs with highest isoform count diffs (foldChange in log2)
| Maize1_url | Maize2_url | Value_maize1 | Value_maize2 | foldChange |
|---|---|---|---|---|
| Zm00001d044186 | Zm00001d011386 | 315 | 5 | 5.977280 |
| Zm00001d040748 | Zm00001d009220 | 92 | 2 | 5.523562 |
| Zm00001d028579 | Zm00001d047831 | 34 | 2 | 4.087463 |
| Zm00001d043900 | Zm00001d011572 | 27 | 2 | 3.754888 |
| Zm00001d040279 | Zm00001d008338 | 23 | 2 | 3.523562 |
| Zm00001d038190 | Zm00001d010282 | 30 | 3 | 3.321928 |
| Zm00001d027848 | Zm00001d048318 | 18 | 2 | 3.169925 |
| Zm00001d019087 | Zm00001d007770 | 18 | 2 | 3.169925 |
| Zm00001d040193 | Zm00001d008398 | 86 | 10 | 3.104337 |
| Zm00001d017798 | Zm00001d051590 | 46 | 6 | 2.938600 |
Here is a scatter plot of isoform counts across gene pairs.
Fig 9.5:
Fig 9.5:
Fig 9.6:
Red Square = Zm00001d020592 Pericarp/Aleurone 27 DAP vs. Zm00001d005793
Yellow Square = Zm00001d003188 Pericarp/Aleurone 27 DAP vs. Zm00001d025753
Yellow X = Zm00001d039040 Mature Leaf 8
Pink Cross = Zm00001d015385 Mature Leaf 8
Report generated on: December 01, 2017